Skip to content

docs(proposal): Improve extensibility of registering power meter backends#2467

Open
nikimanoledaki wants to merge 2 commits into
mainfrom
nm/draft-extensible-power-meter-registry
Open

docs(proposal): Improve extensibility of registering power meter backends#2467
nikimanoledaki wants to merge 2 commits into
mainfrom
nm/draft-extensible-power-meter-registry

Conversation

@nikimanoledaki
Copy link
Copy Markdown
Collaborator

@nikimanoledaki nikimanoledaki commented May 3, 2026

Why

cmd/kepler/main.go constructs CPU and GPU power meters in two different ways:

  • createCPUMeter is a strict RAPL-with-hwmon-fallback chain.
  • createGPUMeters is a soft fan-out across registered vendors.

Three problems follow: asymmetric patterns, hidden GPU init failures, and a strict CPU path that operators cannot extend easily.

What

This proposal:

  • Promotes the unexported powerMeter interface in internal/device/power_meter.go to an exported PowerMeter (Name, Init). Shutdown is optional via service.Shutdowner.
  • Adds cpu.meters config (default ["rapl", "hwmon"]) that operators can select.
  • Replaces the hardcoded fallback chain with a switch dispatch: device.CreateCPUMeter walks the priority list, buildCPUMeter constructs each backend in a switch over backend names.
  • Splits gpu.Discover into three failure modes so real backend failures surface as errors rather than log lines (separable follow-up PR).
  • Translates two legacy keys (experimental.hwmon.forceEnabled, dev.fake-cpu-meter.enabled) to cpu.meters at startup with a deprecation warning.

AI tooling disclosure: I used Claude Code to plan and generate some of these ideas. I prompted and heavily edited any AI-generated text and guarantee that I have read, reviewed, and edited every line.

@github-actions github-actions Bot added the docs Documentation changes label May 3, 2026
@nikimanoledaki nikimanoledaki changed the title docs(proposal): add draft for extensible power meter registry docs(proposal): Make power meter registration easier to extend May 3, 2026
@nikimanoledaki nikimanoledaki force-pushed the nm/draft-extensible-power-meter-registry branch 2 times, most recently from 072edab to 8acc9f0 Compare May 3, 2026 14:19
@nikimanoledaki nikimanoledaki marked this pull request as ready for review May 3, 2026 14:19
@nikimanoledaki nikimanoledaki changed the title docs(proposal): Make power meter registration easier to extend docs(proposal): Make registration power meter device easier to extend May 3, 2026
@nikimanoledaki nikimanoledaki changed the title docs(proposal): Make registration power meter device easier to extend docs(proposal): Make registering power meter device more extensible May 3, 2026
@nikimanoledaki nikimanoledaki changed the title docs(proposal): Make registering power meter device more extensible docs(proposal): Improve extensibility of registering power meter backends May 3, 2026
@nikimanoledaki nikimanoledaki force-pushed the nm/draft-extensible-power-meter-registry branch from 8acc9f0 to 51fea8e Compare May 3, 2026 18:26
@nikimanoledaki nikimanoledaki force-pushed the nm/draft-extensible-power-meter-registry branch 2 times, most recently from c20aaca to e977202 Compare May 3, 2026 21:31
Propose a `device.PowerMeter` interface and per-domain registry to unify
CPU and GPU device discovery in `cmd/kepler/main.go`. Opens the CPU path
to non-RAPL backends (MSR, BMC-derived, external probes) without forking,
matching the registry pattern GPU already uses.

Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com>
@nikimanoledaki nikimanoledaki force-pushed the nm/draft-extensible-power-meter-registry branch from e977202 to 8e887ef Compare May 3, 2026 21:32
- `markdown-table-format`: align column widths in the GPU result-per-scenario table.
- `markdownlint MD024`: rename duplicate `### GPU failure-mode split` heading
  in the Next Steps section to `### GPU failure-mode split (separate PR)`
  to disambiguate from the same heading in Proposed Solution.

Signed-off-by: nikimanoledaki <niki.manoledaki@grafana.com>
@github-actions github-actions Bot added the fix A bug fix label May 3, 2026
Comment on lines +173 to +176
```yaml
cpu:
meters: ["rapl", "hwmon"] # ordered priority
```
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name cpu.meters doesnt convey that it is a ordered fallback chain. could we use cpu.meterPriority or cpu.fallbackChain ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with fallbackChain field.
And, in addition to log, Kepler resource of kepler-operator can have status to present the active (selected) meter.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I can rename this field 👍

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look: #2471

Once we agree on these assumptions, here are the problems that follow.

1. **Asymmetry.** Two patterns for the same job (turn config into hardware abstractions) makes `cmd/kepler/main.go` harder to read and audit.
2. **Hidden GPU failures.** `gpu.Discover` (`internal/device/gpu/registry.go:48`) collapses three failure modes into one `nil` return:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some prerequisites for gpu initialization.

  • the node must have a gpu
  • node feature discovery operator must discover gpu and add some labels to node
  • nvidia (or AMD) gpu operator must be installed and some CRs created.
  • wait for drivers to be installed and initialized etc. this ensures nvml is present
  • now kepler's gpu discovery should succeed

if kepler inits gpu before all the steps are completed, gpu init will fail.
if gpu is enabled, and gpu init fails, kepler cannot separate the two case

  • gpu is not present on the node
  • gpu is present but cannot initialize
    and a cluster can be heterogeneous, i.e onlt few nodes may have gpu

so there was a deliberate decision to allow cpu power to continue to work if gpu side fails for any reason.
@vprashar2929 do you remember same way or did i miss something?

this proposal will fail kepler startup if gpu is configured and gpu init fails, right?

may be there are ways to detect gpu's presence on the node, and if gpu is present, and if gpu init fails we treat it as kepler failure. but the method to detect gpu presence should work in VM nodes also.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vimalk78 Yes this is exact flow.

so there was a deliberate decision to allow cpu power to continue to work if gpu side fails for any reason.

IMO CPU is the base/primary power source for Kepler, and GPU is supplementary. Every node has a CPU, so Kepler should always be able to report CPU power. GPU power only makes sense once that whole prerequisite chain has actually converged on the node. In a heterogeneous cluster tying Kepler's startup to a successful GPU init would mean blocking CPU power monitoring on perfectly healthy nodes just because the GPU side hasn't caught up yet.

```text
internal/device/
├── power_meter.go # PowerMeter interface (promoted from private)
├── cpu_power_meter.go # CPUPowerMeter, CreateCPUMeter, buildCPUMeter
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may have folder for cpu (same as what we have for gpu)

Copy link
Copy Markdown
Collaborator Author

@nikimanoledaki nikimanoledaki May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I agree. I plan to do this in a follow-up PR 👍

I can update the proposal to reflect this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Documentation changes fix A bug fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants